========================================================

This report explores a dataset containing wine quality data for approximately 4898 wines

Univariate Plots Section

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Our dataset consists of 13 variables with 4898 observations. First I will look at the distribution of some of the variables through plots and distribution tables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Sulphates, chlorides, residual sugar and citric acid all appear to be slightly skewed to the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH is normally distributed and doesn’t appear to have a skew.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

With a reduced bin width we observe a normal relationship in the distribution of wine density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol does not seem to have a meaningful distribution either linearly or logarithmically.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality is normally distributed. 6 has the most wines while 9 has the least. The lowest quality is 3 and highest quality is 9

Univariate Analysis

What is the structure of your dataset?

There are 4898 wines in the dataset with 12 features (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). Other observations: Most wines have a quality of 6 Median density of wines is 0.9937
The median alcohol content(10.40) is less than the mean alcohol content(10.51)

What is/are the main feature(s) of interest in your dataset?

The main features of interest in the dataset are the pH, density, alcohol and quality. I’d like to see how the other features influence these two.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The chemical components (chlorides, sulphates, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide, fixed.acidity, volatile.acidity, residual.sugar) directly affect the density and pH of the wines. In turn, pH and density affect the alcohol content and the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I did not create any new variables in the dataset.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I did not transform any data since the distributions didn’t seem that unusually distributed.

Bivariate Plots Section

First, I look at the correlation matrix and table of the variables to establish relatiionships between the variables.

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## alcohol               0.213656245   -0.12088112      0.067717943
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## alcohol              -0.075728730   -0.450631222 -0.36018871
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## alcohol                    -0.2501039415         -0.448892102 -0.78013762
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates     alcohol      quality
## X                    -0.1157741316  0.009807759  0.21365624  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.12088112 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147  0.06771794 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.07572873 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.45063122 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.36018871 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246 -0.25010394  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.44889210 -0.174737218
## density              -0.0935914935  0.074493149 -0.78013762 -0.307123313
## pH                    1.0000000000  0.155951497  0.12143210  0.099427246
## sulphates             0.1559514973  1.000000000 -0.01743277  0.053677877
## alcohol               0.1214320987 -0.017432772  1.00000000  0.435574715
## quality               0.0994272457  0.053677877  0.43557472  1.000000000

From the plot it appears that the following relationships have strong correlations: alcohol vs density, density vs residual.sugar, density vs total.sulfur.dioxide, quality vs alcohol

Let’s explore various relationships that alcohol has with other variables.

Fixed acidity and residual sugar do not seem to have a strong correlation with each other.

Residual sugar has a strong positive correlation with the density of the wines

Citric acid and volatile acidity do not appear to have a correlation with each other.

Strong positive correlation exists between free sulfur dioxide and total sulfur dioxide.

Fixed acidity and volatile acidity do not appear to have a strong correlation with each other.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

There is a negative correlation between alcohol and density of the wines. Alcohol content also has a strong relationship with the quality of wine. Wines  of a quality of 9 have a smaller range in alcohol content compared to wines in the quality levels and also have the highest median in alcohol content. It is, however, interesting to not that the highest alcohol content of the wines occurs at quality 7. The pH does not seem to have any strong correlation to the alcohol content of  the wines which is contrary to what I expected. Density has a slight negative correlation with the alcohol content of the wines.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Fixed acidity and volatile acidity do not appear to be correlated with each other, which is counter to what I expected. The total sulfur dioxide is strongly positively correlated with the free sulfur dioxide in the wines.

What was the strongest relationship you found?

The strongest positive relationship is between density and residual sugar of the wines while the strongest negative relationship is between alcohol and density. Looking at the plot, we do have one extreme outlier that could be influencing  the overall relationship.

Multivariate Plots Section

I want to see how acidity in general affects alcohol content of the wine. I will compare fixed.acidity, volatile.acidity, citric.acid,pH and alcohol

The two levels of acidity (fixed and volatile) do not seem to have that much   of a correlation when it comes to the alcohol content of the wines. pH appears  to be some sort of average since it is in the midpoint of the two acid types.

Next I make some plots to investigate how density, alcohol and pH relate to each other?

Strongest relationship found in the correlation matrix was between  residual.sugar and density. For the amount of alcohol, we have residual.sugar having a positive correlation, and total.sulfur and density are strongly negatively correlated, Let’s find a model for this:

## 
## Call:
## lm(formula = alcohol ~ residual.sugar + total.sulfur.dioxide + 
##     density, data = wqdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9956 -0.4163 -0.0501  0.3534 16.2984 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.626e+02  5.808e+00  96.860   <2e-16 ***
## residual.sugar        1.668e-01  3.207e-03  51.998   <2e-16 ***
## total.sulfur.dioxide -2.368e-04  2.456e-04  -0.964    0.335    
## density              -5.565e+02  5.873e+00 -94.745   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6167 on 4894 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7489 
## F-statistic:  4869 on 3 and 4894 DF,  p-value: < 2.2e-16

Wines with a quality of 9 appear to have a smaller range in alcohol/pH and also have the highest median of alcohol/pH.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Wines that have a quality of 9 have the smallest range in alcohol/pH and also have the highest alcohol/pH of all the wines. ### Were there any interesting or surprising interactions between features? The levels of factors that influence the pH appear do not appear to have any correlation with the level of alcohol in the wines. Only citric. acid seems to have any correlation with pH and the level of alcohol in the wines. ### OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model. I made a model to estimate the alcohol level of the wines using the residual sugar, total sulfur dioxide and density. The variables account for 74.9% of the variability in the alcohol in the wine. ——

Final Plots and Summary

Plot One

Description One

The density of wines is negatively correlated with the alcohol content of the wines.There is an interesting outlier that has high density but just a slightly above average alcohol content. Most of the points are clustered so this outlier does not affect the average that much.

Plot Two

Description Two

Wines with a quality of 9 seems to have two prominent peaks in density as  alcohol increases and is otherwise at zero density while the other qualities  have a more gradual distribution in the density.

Plot Three

Description Three

Wines with a quality of 9 have the highest median alcohol/pH concentration but also appear to have the smallest range in alcohol/pH.


Reflection

In this Analysis, I looked at a dataset on white wine quality. This dataset  comes from work done by Cortez et al.,2009.There are 4898 observations in this dataset and 13 variables. I started by understanding the individual variables in the dataset and then I explored the interesting relations these variables have  by making plots. Eventually I made a linear model to estimate the quality of  wine using variables that had a high correlation with each other.

There was a clear negative correlation between the quality of wine and the density of wine. I was surprised that the fixed acidity and the variable acidity did not appear to have a correlation with each other. I struggled to make sense of this since I thought acidity would have a correlation with each other.

Some limitations with this model include the source data. It doesn’t account for the seasonality of wine and the regions where the grapes were grown. Adding  these variables would have made the dataset more robust. Future work should include these variables in the analysis.